Search CORE

389 research outputs found

Factors Influencing the Surprising Instability of Word Embeddings

Author: Kummerfeld Jonathan K.
Mihalcea Rada
Wendlandt Laura
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2018
Field of study

Despite the recent popularity of word embedding methods, there is only a small body of work exploring the limitations of these representations. In this paper, we consider one aspect of embedding spaces, namely their stability. We show that even relatively high frequency words (100-200 occurrences) are often unstable. We provide empirical evidence for how various factors contribute to the stability of word embeddings, and we analyze the effects of stability on downstream tasks.Comment: NAACL HLT 201

arXiv.org e-Print Archive

Crossref

Text-to-text semantic similarity for automatic short answer grading

Author: Michael Mohler
Rada Mihalcea
Publication venue
Publication date: 01/01/2009
Field of study

In this paper, we explore unsupervised techniques for the task of automatic short answer grading. We compare a number of knowledge-based and corpus-based measures of text similarity, evaluate the effect of domain and size on the corpus-based measures, and also introduce a novel technique to improve the performance of the system by integrating automatic feedback from the student answers. Overall, our system significantly and consistently outperforms other unsupervised methods for short answer grading that have been proposed in the past.

CiteSeerX

Crossref

Babylon parallel text builder: Gathering parallel texts for low-density languages

Author: Michael Mohler
Rada Mihalcea
Publication venue
Publication date: 01/01/2008
Field of study

This paper describes BABYLON, a system that attempts to overcome the shortage of parallel texts in low-density languages by supplementing existing parallel texts with texts gathered automatically from the Web. In addition to the identification of entire Web pages, we also propose a new feature specifically designed to find parallel text chunks within a single document. Experiments carried out on the Quechua-Spanish language pair show that the system is successful in automatically identifying a significant amount of parallel texts on the Web. Evaluations of a machine translation system trained on this corpus indicate that the Web-gathered parallel texts can supplement manually compiled parallel texts and perform significantly better than the manually compiled texts when tested on other Web-gathered data. 1

CiteSeerX